Foundamentals of Mathematical Statistics

Intro

Definition (mean（平均值）): The sample mean（平均值） of observed values $x_1,…,x_n \in \mathbb{R}$ is

$\bar{x}_n = \frac{1}{n}\sum^n_{i=1}x_i.$

用sample把它和概率论里的mean/expectation区分一下。

Definition (median（中位数）): The sample median（中位数） of observed values is

$x_{0.5} =\begin{cases}x_{\frac{n+1}{2}}, & \text{if $n$ odd}, \\[1em]\dfrac{1}{2}\Bigl[x_{\frac{n}{2}} + x_{\frac{n}{2}+1}\Bigr], & \text{if $n$ even},\end{cases}$

with $x_1 \leq x_2 \leq … \leq x_n$ being the sorted data points.

如果总数为基数，就取最中间的那个数；如果总数是偶数，那就取中间2个的平均值。

Definition (Statistical model 统计模型):

$\mathcal{P} = \text{a set of probability distributions } P \text{ on } (X,A).$

Definition (Parameter): A (statistical) parameter of a statistical model $\mathcal{P}$ is a map $\gamma : \mathcal{P} \to \text{ some set }\Tau$ .

例子：

- 期望值 Mean/expectation

$\gamma(P) = \int x dP(x)$

- Variance

$\gamma(P) = \int x^2 dP(x) - (\int x dP(x))^2$

- Correlations

Construction of Estimators

Definition (Estimators): A estimator is a function that map data to estimates of quantities of interest.

简单来讲，estimator就是一个函数，它的定义域是数据，叫estimand，它的值域里的元素则是叫estimates。

Plug-in Estimator

Definition (empirical distribution): The empirical distribution of $x_1, \ldots, x_n \in \mathbb{R}$ is the probability distribution $\hat{P}_n$ given by

$\hat{P}_n(A) := \frac{1}{n} \#\{ i : x_i \in A \}= \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{ x_i \in A \}}, \qquad A \subseteq \mathbb{R}.$

一个相对离散的概率分布。

Definition (empirical distribution function (ecdf)): The empirical distribution function (ecdf) of $x_1, \ldots, x_n$is the distribution function of $\hat{P}_n$, which is

$\hat{F}_n(t) := \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{ x_i \le t \}},\qquad t \in \mathbb{R}.$

一个连续的分布函数。

Theorem (Glivenko—Cantelli): If $X_1, X_2, \ldots$ are i.i.d. random variables with cdf (cumulative distribution function) $F$, then

$\lVert \hat{F}_n - F \rVert_\infty:= \sup_{t \in \mathbb{R}} \bigl| \hat{F}_n(t; X_1, \ldots, X_n) - F(t) \bigr|\xrightarrow{\text{a.s.}} 0.$

也就是说，当我们有足够多的样本时，这个esimator会趋近于它原本的分布。

Theorem: If $U \sim Unif(0,1)$,then $X:=F^{-1}(U)$ has cdf $F$.

Definition: The plug-in estimator of $\gamma(F)$ is the estimator $\hat{\gamma} = \gamma(\hat{F_n})$.

例子：

考虑期望值 $\gamma(F) = \int x dF(x)$。那么有

$\gamma(\hat{F_n}) = \int x d\hat{F_n}(x) = \frac{1}{n} \sum_{i=1}^n X_i = \overline{X}_n$

M-Estimator

Definition (M-estimator): An estimator $\hat{\theta}(X_1, \ldots, X_n)$ maximizing a criterion function of the form

$\theta \;\mapsto\; \frac{1}{n} \sum_{i=1}^n m_\theta(X_i),$

where $m_\theta$ is a known function, is called an M-estimator (maximum-likelihood type).

例子：

- $\theta \in \mathbb{R}$, 如果选$m_\theta(x) = -(x - \theta)^2$，那么就会得到平均值 $\bar{X}_n$。

- 选$m_\theta(x) = -|x - \theta|$，会得到中位数。

Method of Moments (MOM)

Given a parametric model for real-valued observations

$X_1,...,X_n \overset{i.i.d}{\sim}P_\theta, \ \ \theta \in \Theta \subseteq \mathbb{R}^k$

consider the moments

$\mu_j(\theta) = E_\theta [X^j_i] \text{ for } j=1,...,k.$

If it exists then the $j$-th moment may be estimated by

$\hat{\mu}_j = \frac{1}{n}\sum^n_{i=1}X^j_i.$

Definition: The MOM estimator $\hat{\theta}$ is the value of $\theta$ that solves the equation system

$\mu_j(\theta) = \hat{\mu}_j, \ \ j =1,...,k.$

$E_\theta [X^j_i] = \frac{1}{n}\sum^n_{i=1}X^j_i.$

例子（高斯）：

Suppose $P_\theta = \mathcal{N}(\mu, \sigma^2)$ with mean and variance unknown.So,

$\theta = (\mu, \sigma^2) \in \Theta = \mathbb{R} \times (0,\infty).$

The density:

$p_\theta(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2} (x - \mu)^2 \right\}, \qquad x \in \mathbb{R}.$

需要解的方程组：

$\mu_1(\theta) = \mathbb{E}_\theta[X_1] = \mu \;\stackrel{!}{=}\; \overline{X}_n,$ $\mu_2(\theta) = \mathrm{Var}_\theta[X_1] + \big(\mathbb{E}_\theta[X_1]\big)^2 = \sigma^2 + \mu^2 \;\stackrel{!}{=}\; \hat{\mu}_2.$

最后得到sample mean以及empirical variance：

$\hat{\mu} = \overline{X}_n,$ $\hat{\sigma}^2 = \hat{\mu}_2 - (\overline{X}_n)^2 = \frac{1}{n} \sum_{i=1}^n X_i^2 - \left( \frac{1}{n} \sum_{i=1}^n X_i \right)^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2.$

Maximum Likelihood Estimator (MLE)

Consider a parametric model for the observation

$X \sim P_\theta, \qquad \theta \in \Theta \subseteq \mathbb{R}^k.$

Assume the model $\mathcal{P} = \{ P_\theta : \theta \in \Theta \}$ is dominated by a $\sigma$-finite measure $\nu$, i.e., $P_\theta \ll \nu$ for all $\theta \in \Theta$, and so we have densities

$p_\theta(x) = \frac{\mathrm{d} P_\theta}{\mathrm{d} \nu}(x), \qquad x \in \mathcal{X}.$

（这个so的结论是概率论里的结论。）

Definition: The function $L_x(\theta) = p_\theta(x)$ is the likelihood function of model $\mathcal{P}$ for the data $x$.

有密度函数直接套密度函数（density function）。

Definition: The maximum likelihood estimate (MLE) of $\theta$ is

$\hat{\theta} = \arg \max_{\theta \in \Theta} L_x(\theta).$

If $\hat{\theta}(X)$ is a measurable function of the observation $X$, then $\hat{\theta}(X)$ is called maximum likelihood estimator (MLE) of $\theta$.

只不过我们通常其实会考虑所谓的 log -likelihood function

$\ell_x (\theta) = \text{log}L_x(\theta).$

这样做有2点好处：

可以避免numerical overflow；
更方便计算。（比如说如果$L_x$本身是product的形式，那$l_x$则会变成sum的形式。）

例子（高斯）：

Suppose $X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$, so $\theta = (\mu, \sigma^2) \in \mathbb{R} \times (0,\infty)$.

Assume $n \ge 2$, so that $\frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 > 0
\quad \text{a.s.}$.

The log -likelihood function：

$\begin{align*} \ell_X(\theta) &= \sum_{i=1}^n \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ -\frac{1}{2\sigma^2}(X_i - \mu)^2 \right\} \right) \\ &= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2. \end{align*}$

不难得到：

$\hat{\mu} = \arg\min_{\mu \in \mathbb{R}} \sum_{i=1}^n (X_i - \mu)^2 = \overline{X}_n,$ $\hat{\sigma}^2 = \arg\min_{\sigma^2 > 0} \left[ \log(\sigma^2) + \frac{1}{\sigma^2} \cdot \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2 \right] = \frac{1}{n} \sum_{i=1}^n (X_i - \overline{X}_n)^2.$

贝叶斯估计（Bayes Estimators）

在之前的构造里，所有参数都取决于我们拿到的数据，完全不受我们任何经验/前置知识的影响。

但我们现在希望修改这个模式：我们希望最后得到的参数同时取决于数据以及我们的经验/先前的判断。

贝叶斯推断（Bayesian Inference）的流程：

1. 把先前的常数值$\theta$当作random variable（随机变量），并选择一个prior distribution（先验分布，即我们观察数据前的做出的判断）。

2. 把$P_\theta$ 当作 the conditional distribution of $X$ given $\theta$。

3. 在观察到了数据 $x$ 之后，做统计推断时，看 $\theta$ 的后验分布（posterior distribution），也就是 “在观测到 $X = x$ 之后，$\theta$ 的条件分布”。

Consider an observation modeled as $X\sim P_\theta, \theta \in \Theta \subseteq \mathbb{R}.$

Theorem (Bayes theorem): Suppose the prior distribution has density $\pi$ w.r.t. a measure $\nu$ and
$P_\theta \ll \nu \ \ \forall \theta$ with densities $p_\theta(x) = p(x \mid \theta)$.

Then the posterior distribution has density (w.r.t. $\nu$):

$p(\theta \mid x)= \frac{p(x \mid \theta)\,\pi(\theta)}{p(x)} ,$

where

$p(x) = \int_\Theta p(x \mid \theta)\,\pi(\theta)\, \mathrm{d}\nu(\theta)$

is the prior predictive density of $X$.

Bayes estimators of $\theta$ are obtained as characteristics of the posterior distribution.
Most frequently, one considers the posterior mean:

$\hat{\theta} = \mathbb{E}[\theta \mid X = x] = \int \theta \cdot p(\theta \mid x) \,\mathrm{d}\nu(\theta).$

例子（高斯）：

Assume $X_1, \ldots, X_n$ i.i.d.\ $\mathcal{N}(\mu, \sigma^2)$ with $\sigma^2 > 0$ known. We select as prior distribution $\mu \sim \mathcal{N}(m, \tau^2)$ so

$\pi(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\!\left( -\frac{1}{2\tau^2}(\mu - m)^2 \right).$

The likelihood function is equal to

$L_X(\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{1}{2\sigma^2} (X_i - \mu)^2 \right).$

The posterior density is

$p(\mu \mid X) \propto L_X(\mu)\,\pi(\mu) \propto \exp\!\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2 -\frac{1}{2\tau^2} (\mu - m)^2 \right) \qquad \text{(quadratic in $\mu$)}.$

We recognize that the posterior distribution will be a normal distribution.
More precisely,

$\begin{align*} p(\mu \mid X) &\propto \exp\!\left\{ -\frac{1}{2\sigma^2} \Bigl( n\cdot \mu^2 - 2\mu \sum_{i=1}^n X_i \Bigr) -\frac{1}{2\tau^2}(\mu^2 - 2\mu m) \right\}\\ &= \exp\!\left\{ \Bigl( -\frac{n}{2\sigma^2} - \frac{1}{2\tau^2} \Bigr)\mu^2 -2\mu \Bigl( -\frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2} \Bigr) \right\}\\ &= \exp\!\left\{ a \mu^2 - 2 b \mu \right\}, \end{align*}$

where

$a = -\frac{n}{2\sigma^2} - \frac{1}{2\tau^2}, \qquad b = \frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2}.$

We conclude that since $p(\mu \mid X) \propto \exp\{ a \cdot (\mu - b/a)^2 \}$,
it holds that $p(\mu \mid X)$ is the density of a normal distribution with mean and variance:

$\mathbb{E}[\mu \mid X] = \frac{b}{a}, \qquad \mathrm{Var}[\mu \mid X] = -\frac{1}{2a}.$

The posterior mean is a convex combination of $\overline{X}_n$ and the prior mean $m$:

$\begin{align*} \mathbb{E}[\mu \mid X] = \frac{b}{a} &= \frac{\frac{n\overline{X}_n}{2\sigma^2} - \frac{m}{2\tau^2}} {-\frac{n}{2\sigma^2} - \frac{1}{2\tau^2}}\\ &= \overline{X}_n \cdot \frac{n/\sigma^2}{n/\sigma^2 + 1/\tau^2} + m \cdot \frac{1/\tau^2}{n/\sigma^2 + 1/\tau^2}\\ &= \overline{X}_n \cdot \frac{\tau^2}{\tau^2 + \sigma^2/n} + m \cdot \frac{\sigma^2/n}{\tau^2 + \sigma^2/n}. \end{align*}$

当我们设：

$\lambda = \frac{\tau^2}{\tau^2 + \sigma^2/n}$

则有：

$\mathbb{E}[\mu \mid X] = \lambda \cdot\overline{X}_n + (1-\lambda)m.$

注意到当n趋近于无穷时，$\lambda$会趋近于1。也就是说，当n越大，最后算出来的mean就会越取决于我们拿到的数据。反之，n越小，则会更取决于我们的先验知识。

Mean Square Error, Bias and Variance

Definition: The mean square error is defined as

$\mathrm{MSE}_\theta[\hat{\theta}]:= \mathbb{E}_\theta \big[ (\hat{\theta} - \theta)^2 \big]= \int_{\mathcal{X}} (\hat{\theta}(x) - \theta)^2 \, \mathrm{d}P_\theta(x).$

Theorem: The mean square error decomposes as

$\mathrm{MSE}_\theta[\hat{\theta}]= \big( \mathrm{Bias}_\theta[\hat{\theta}] \big)^2+ \mathrm{Var}_\theta[\hat{\theta}],$

where $\mathrm{Bias}_\theta[\hat{\theta}]:= \mathbb{E}_\theta[\hat{\theta}] -\theta$ is the bias of $\hat{\theta}.$

Proof:

Write $\mathrm{MSE}_\theta[\hat{\theta}]
= \mathbb{E}_\theta
\Big[
\big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] + \mathbb{E}_\theta[\hat{\theta}] - \theta \big)^2 \Big] $ and expand

$\begin{align*} \mathrm{MSE}_\theta[\hat{\theta}] &= \mathbb{E}_\theta \Big[ \big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}]\big)^2 \Big] + \mathbb{E}_\theta \Big[ \big(\mathbb{E}_\theta[\hat{\theta}] - \theta\big)^2 \Big] + 2 \, \mathbb{E}_\theta \Big[ \big(\hat{\theta} - \mathbb{E}_\theta[\hat{\theta}]\big) \big(\mathbb{E}_\theta[\hat{\theta}] - \theta\big) \Big].\\ &= \mathrm{Var}_\theta[\hat{\theta}] + \big(\mathrm{Bias}_\theta[\hat{\theta}]\big)^2 + 2 \cdot \mathrm{Bias}_\theta[\hat{\theta}] \cdot \underbrace{ \mathbb{E}_\theta \big[ \hat{\theta} - \mathbb{E}_\theta[\hat{\theta}] \big] }_{=0}\\ &= \mathrm{Var}_\theta[\hat{\theta}] + \big(\mathrm{Bias}_\theta[\hat{\theta}]\big)^2. \end{align*}$

$\square$